Skip to content

Conversation

ikawrakow
Copy link
Owner

This PR corresponds to PRs #531, #533, #534, #546, #549, #550, #552, and applies the on-the-fly repacking technique to
the 1-bit quants IQ1_S and IQ1_M on ARM_NEON.

Here is a PP-512 performance comparison between the main branch and this PR for LlaMA-3.1-8B-Instruct on M2-Max

type t/s (main) t/s (PR) Speedup
IQ1_S 66.3 168.8 2.546
IQ1_M 19.0 163.9 8.626

IQ1_M did not have a faster IQK implementation, so the 19 t/s is what one has within the standard ggml GEMM framework.

Iwan Kawrakow added 2 commits June 24, 2025 13:27
66.3 t/s -> 168.8 t/s.
19 t/s -> 163 t/s.
@ikawrakow ikawrakow merged commit b5f2f00 into main Jun 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant